The data is extracted from 1994 US census database and was found at the UCI ML repository: https://archive.ics.uci.edu/ml/datasets/Adult
We will try to analyze how different sociodemographical indicators affect the likelihood of a person earning more than 50,000$ a year.

Section 1: Exploration.

The comments about the chunk are given before the chunk.

First, let’s import the dataset and format it a bit for easier exploration.
Initialize, read file, assign column names.
Change actual NA values to proper NA object, and drop unused levels.

library(ggplot2)
library(dplyr)
library(reshape2)
library(scales)

adult <- read.csv('~/DataAnalyst/Projects/DataAnalystND_Project_4/adult/adult.data', header = F)
names(adult) <- c('age','workclass','fnlwgt','education','education_num',
                  'marital_status','occupation','relationship','race','sex',
                  'capital_gain','capital_loss','hours_per_week','native_country','income')
levels(adult$income) <- c('low','high')
adult[adult ==' ?'] <- NA
adult <- droplevels(adult)

#This addition to ggplot plots will set alpha of the legend to 100% for better readability.
fix_alpha <- guides(colour = guide_legend(override.aes = list(alpha = 1)))

Arrange education levels by the provided ‘education_num’ variable.
Arrange other factors by frequency of high salary(High Salary Ratio, HSR) - from lowest to highest.
Remove unneeded columns.
Group very low frequency workclass levels together.
Remove spaces from factor levels.

I’ll refer to high salary ratio of a group (number of people having high income divided by group size) as HSR.
We’ve stored HSR’s for variable levels for each variable in the adult_by[[variable]] list.
Let’s look how each variable affects the income:
age: HSR increases from 16 to 50, then declines. For some reason, ages 79 and 83 have very high HSR.
workclass: By far most people work in Private sector which also has the lowest HSR. Being Self-emp-inc (probably company owners) is paid very good, working for federal government is also paid well.
education: having high education definitely increases income.
marital_status: married-civ-spouse and married-af-spouse have the highest HSR.
About 1/3 of the respondents were never married, and they have the lowest HSR.
occupation: About 25% of the data are Prof-speciality and Exec-managerial - two highest HSR categories.
Priv-house-serv occupation has HSR of only 0.6%.
relationship: Wives and Husbands have very high HSR. Wives have even higher HSR than Husbands, despite that women have HSR of 11% and men of 30.5%.
race: Black and Native American have half as high HSR than White and Asian.
sex: For some reason, there are twice as many men as women in the survey, and for men the HSR is 3 times higher than HSR for women.
hours per week: 0 to 25 hpw - HSR decreases a bit, 25-60 - increases, and then decreases, probably because the top-paid executives don’t work long hours.
native_country: Caribbean and Latin-American have the lowest HSR. US-natives are somewhere in the middle, and the top of the list cannot be attributed to some region in particualar.
90% of the people are US-natives.

for (name in names(adult[,-11])) {
  print(adult_by[[name]])
}
## Source: local data frame [73 x 3]
## 
##      age high_salary_ratio     n
##    (int)             (dbl) (int)
## 1     17       0.000000000   395
## 2     18       0.000000000   550
## 3     20       0.000000000   753
## 4     82       0.000000000    12
## 5     85       0.000000000     3
## 6     86       0.000000000     1
## 7     87       0.000000000     1
## 8     88       0.000000000     3
## 9     19       0.002808989   712
## 10    21       0.004166667   720
## ..   ...               ...   ...
## Source: local data frame [8 x 3]
## 
##          workclass high_salary_ratio     n
##             (fctr)             (dbl) (int)
## 1           No_pay         0.0000000    21
## 2               NA         0.1040305  1836
## 3          Private         0.2186729 22696
## 4        State-gov         0.2719569  1298
## 5 Self-emp-not-inc         0.2849272  2541
## 6        Local-gov         0.2947922  2093
## 7      Federal-gov         0.3864583   960
## 8     Self-emp-inc         0.5573477  1116
## Source: local data frame [16 x 3]
## 
##       education high_salary_ratio     n
##          (fctr)             (dbl) (int)
## 1     Preschool        0.00000000    51
## 2       1st-4th        0.03571429   168
## 3       5th-6th        0.04804805   333
## 4          11th        0.05106383  1175
## 5           9th        0.05252918   514
## 6       7th-8th        0.06191950   646
## 7          10th        0.06645230   933
## 8          12th        0.07621247   433
## 9       HS-grad        0.15950862 10501
## 10 Some-college        0.19023454  7291
## 11   Assoc-acdm        0.24835989  1067
## 12    Assoc-voc        0.26121563  1382
## 13    Bachelors        0.41475257  5355
## 14      Masters        0.55658735  1723
## 15  Prof-school        0.73437500   576
## 16    Doctorate        0.74092010   413
## Source: local data frame [7 x 3]
## 
##          marital_status high_salary_ratio     n
##                  (fctr)             (dbl) (int)
## 1         Never-married        0.04596087 10683
## 2             Separated        0.06439024  1025
## 3 Married-spouse-absent        0.08133971   418
## 4               Widowed        0.08559919   993
## 5              Divorced        0.10420887  4443
## 6     Married-AF-spouse        0.43478261    23
## 7    Married-civ-spouse        0.44684829 14976
## Source: local data frame [15 x 3]
## 
##           occupation high_salary_ratio     n
##               (fctr)             (dbl) (int)
## 1    Priv-house-serv       0.006711409   149
## 2      Other-service       0.041578149  3295
## 3  Handlers-cleaners       0.062773723  1370
## 4                 NA       0.103635377  1843
## 5       Armed-Forces       0.111111111     9
## 6    Farming-fishing       0.115694165   994
## 7  Machine-op-inspct       0.124875125  2002
## 8       Adm-clerical       0.134482759  3770
## 9   Transport-moving       0.200375704  1597
## 10      Craft-repair       0.226640644  4099
## 11             Sales       0.269315068  3650
## 12      Tech-support       0.304956897   928
## 13   Protective-serv       0.325115562   649
## 14    Prof-specialty       0.449033816  4140
## 15   Exec-managerial       0.484013773  4066
## Source: local data frame [6 x 3]
## 
##     relationship high_salary_ratio     n
##           (fctr)             (dbl) (int)
## 1      Own-child        0.01322021  5068
## 2 Other-relative        0.03771662   981
## 3      Unmarried        0.06326175  3446
## 4  Not-in-family        0.10307044  8305
## 5        Husband        0.44857121 13193
## 6           Wife        0.47512755  1568
## Source: local data frame [5 x 3]
## 
##                 race high_salary_ratio     n
##               (fctr)             (dbl) (int)
## 1              Other        0.09225092   271
## 2 Amer-Indian-Eskimo        0.11575563   311
## 3              Black        0.12387964  3124
## 4              White        0.25585994 27816
## 5 Asian-Pac-Islander        0.26564004  1039
## Source: local data frame [2 x 3]
## 
##      sex high_salary_ratio     n
##   (fctr)             (dbl) (int)
## 1 Female         0.1094606 10771
## 2   Male         0.3057366 21790
## Source: local data frame [94 x 3]
## 
##    hours_per_week high_salary_ratio     n
##             (int)             (dbl) (int)
## 1              11                 0    11
## 2              19                 0    14
## 3              23                 0    21
## 4              31                 0     5
## 5              74                 0     1
## 6              77                 0     6
## 7              81                 0     3
## 8              82                 0     1
## 9              86                 0     2
## 10             87                 0     1
## ..            ...               ...   ...
## Source: local data frame [42 x 3]
## 
##                native_country high_salary_ratio     n
##                        (fctr)             (dbl) (int)
## 1          Holand-Netherlands        0.00000000     1
## 2  Outlying-US(Guam-USVI-etc)        0.00000000    14
## 3          Dominican-Republic        0.02857143    70
## 4                    Columbia        0.03389831    59
## 5                   Guatemala        0.04687500    64
## 6                      Mexico        0.05132193   643
## 7                   Nicaragua        0.05882353    34
## 8                        Peru        0.06451613    31
## 9                     Vietnam        0.07462687    67
## 10                   Honduras        0.07692308    13
## ..                        ...               ...   ...

We’ve talked about age, but the relationship is seen better in the graph.

ggplot(data = adult_by[['age']], aes(x = age, y = high_salary_ratio)) +
  geom_line()

Let’s break the age-hsr by education and sex.

adult_by_age_sex<- adult %>%
    group_by(age, sex) %>%
    summarise(high_salary_ratio = sum(income == 'high')/n(),
              n = n()) %>%
    arrange(age, sex)
    
ggplot(adult_by_age_sex, aes(x = age, y = high_salary_ratio, color = sex)) +
  geom_line()

adult_by_age_education<- adult %>%
    group_by(age, education) %>%
    summarise(high_salary_ratio = sum(income == 'high')/n(),
              n = n()) %>%
    arrange(age, education)

ggplot(adult_by_age_education, aes(x = age, y = high_salary_ratio, color = education)) +
  geom_line(size = 2)

ggplot(adult_by_age_education, aes(x = age, y = high_salary_ratio, color = education)) +
  geom_point(size = 5)

Women are younger than men.

ggplot(adult, aes(x = sex, y = age, fill = sex)) +
  geom_violin()

As we’ve seen, high-income people are older.

ggplot(adult, aes(x = income, y = age, fill = income)) +
  geom_violin()

The HPW distribution looks somewhat normal. Hard to say, but looks like most of the high income comes from high HPW.

ggplot(data = adult, aes(x = hours_per_week, fill = income)) +
  geom_histogram(binwidth = 2) +
  coord_cartesian(ylim = c(0,3000))

ggplot(data = adult, aes(x = hours_per_week, fill = income)) +
  geom_histogram(binwidth = 2, position = 'dodge') +
  coord_cartesian(ylim = c(0,3000))

Working about 60 hpw is paid best, working more - a little less.
It’s interesting that the slope of the curve around the 60 hpw is similar on both sides.

ggplot(data = adult_by[['hours_per_week']], aes(x = hours_per_week, y = high_salary_ratio)) +
  geom_line() +
  geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

Let’s see how much people with different education work. There’s a definite curve in the mean hpw (red diamonds).

ggplot(adult, aes(x = education, y = hours_per_week, fill = education)) +
  geom_boxplot() +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=18, size=3, show_guide = FALSE)

Most of the people are HS-grads, Some-college or Barchelors. Nothing particularly interesting about the races.

ggplot(data = adult, aes(x = education, fill = race)) +
  geom_histogram()

HSR for men with high education is very high.

ggplot(data = adult, aes(x = education, fill = income)) +
  geom_histogram() +
  facet_wrap(~sex, ncol = 1)

As we’ve already seen, people with better education are older. No surprise here.
People who have have average education have very high age difference between high and low income.

ggplot(data = adult, aes(x = education, y = age, fill = income)) +
  geom_boxplot()

We see that people of the lower education categories (from preschool to about 9th grade) are older than the subsequent categories.
That is probably because these people are most likely dropouts (and they could drop out a while ago), while people with 10th grade education or higher could be still studying (we have respondents of age 16+ in the survey).
Men with lower education are younger than women with same education, and the opposite is true for higher education.

ggplot(data = adult, aes(x = education, y = age, fill = sex)) + 
  geom_boxplot()

High-income people on average work 6.6 more hours_per_week than low-income.

by(adult$hours_per_week,adult$income, mean)
## adult$income: low
## [1] 38.84021
## -------------------------------------------------------- 
## adult$income: high
## [1] 45.47303
ggplot(data = adult, aes(x = income, y = hours_per_week)) +
  geom_boxplot() +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=18, size=3,show_guide = FALSE)

High-income people are on average 7.5 years older.

by(adult$age,adult$income, mean)
## adult$income: low
## [1] 36.78374
## -------------------------------------------------------- 
## adult$income: high
## [1] 44.24984
ggplot(data = adult, aes(x = income, y = age)) +
  geom_boxplot() +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=18, size=3,show_guide = FALSE)

Most of the husbands and wives(highest HSR relationships) are married-civ-spouse(highest HSR marital status).

ggplot(adult, aes(x = relationship, fill = marital_status)) +
  geom_histogram()

ggplot(adult, aes(x = marital_status, fill = relationship)) +
  geom_histogram()

As people become older, they tend to either marry or become widowed and divorce less:

ggplot(data = adult, aes(x = age, fill = marital_status)) +
  geom_histogram(binwidth = 1)

Most of the younger people are children.
As we’ve seen, number of husbands is much higher than number of wives.

ggplot(data = adult, aes(x = age, fill = relationship)) +
  geom_histogram(binwidth = 1)

Let’s break it by sexes.
There are way more married men than women.
The histogram for women is much more skewed to the left than for men.

ggplot(data = adult, aes(x = age, fill = relationship)) +
  geom_histogram(binwidth = 1) + 
  facet_wrap(~sex, ncol = 1)

We see that in Own-child and Husband/Wife relationships women are older, and in other relationships women are younger.

ggplot(data = adult, aes(x = relationship, y = age, fill = sex)) +
  geom_boxplot()

We have 2 male wives and 1 female husband. These are probably errors.

nrow(subset(adult, sex == 'Male' & relationship == 'Wife'))
## [1] 2
nrow(subset(adult, sex == 'Female' & relationship == 'Husband'))
## [1] 1

Older people and women work less.

ggplot(data = adult, aes(x = age, y = hours_per_week, color = sex)) +
  geom_jitter(alpha = 0.5) +
  fix_alpha

The majority of high-income people come from 2 highest-paid occupations.

ggplot(data = adult, aes(x = occupation, fill = income)) +
  geom_histogram()

ggplot(data = adult, aes(x = occupation, fill = income)) +
  geom_histogram(position = 'dodge')

The people in self-employed workclasses are mostly men.
NA category has very high women/men ratio.

ggplot(data = adult, aes(x = workclass, fill = sex)) +
  geom_bar(position = 'dodge')

The majority of the difference in male-female populations is due to ‘white’ race.
Black race has very high woman ratio.

ggplot(data = adult, aes(x = race, fill = sex)) +
  geom_histogram(position = 'dodge')

Let’s see which occupations are dominated by either sex.
Remeber, that the occupations are arranged by HSR.
The highest female ratio is in the lowest-paid occupation(‘Priv-house-serv’).
Other female occupations: Adm-clerical and Other-service.
Male occupation: Handlers-cleaners, Armed-forces, Transport-moving, Craft-repair, Protective-serv.
Pretty much as expected.

occupation_by_sex = adult %>%
  group_by(occupation) %>%
  summarise(female_ratio = sum(sex == 'Female')/n(),
            n = n())

ggplot(data = occupation_by_sex, aes(x = occupation, y = female_ratio)) +
  geom_bar(stat='identity')

The countries are ordered by HSR.
In both high-HSR and low-HSR countries there are countries with many men(Mexico, India) or women(Dominican Republic, Germany), and countries with more younger(Guatemala, Taiwan) or older(Puerto-Rico, Italy) people.
Thus, there is not much new information here.

ggplot(data = adult, aes(x = age, fill = sex)) +
  geom_histogram(binwidth = 2) +
  facet_wrap(~native_country, scales = 'free_y')

Section 2: Final Plots and Summary

Plot One

ggplot(data = adult, aes(x = age, fill = relationship)) +
  geom_histogram(binwidth = 1) + 
  facet_wrap(~sex, ncol = 1) +
  xlab('Age, years') +
  ylab('Number of Respondents') +
  ggtitle('Histogram of Ages, by Sex and relationship')

Description One

The distribution of women looks very different to the distribution of men.
There are less women than men and they are younger.
The histograms of all the relationships for men and women, except husband and wife, look the same.

Plot Two

ggplot(data = adult_by[['hours_per_week']], aes(x = hours_per_week, y = high_salary_ratio)) +
  geom_line() +
  geom_smooth() +
  xlab('Hours per week working') +
  ylab('High salary ratio (frequency of high-income people)') +
  ggtitle('Relation of income to working hours per week')
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

Description Two

The lowest HSR is at 0-25 hpw, and after that HSR rises up to about 60 hpw.
But after 60 hpw the average hsr decreases.
We see that people who work about 100 hours a week earn about the same ass people who work 40 hours.

Plot Three

ggplot(data = occupation_by_sex, aes(x = occupation, y = female_ratio)) +
  geom_bar(stat='identity') +
  scale_y_continuous(labels = percent) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylab('Percentage of females in occupation') +
  xlab('Occupation') +
  ggtitle('Occupations by gender')

Description Three

Here we can see which professions are dominated by either gender.
The occupations are ordered by HSR (leftmost occupation is worst-paid).
The occupations that have more women than others: Priv-House-Serv(lowest-paid occupation), Adm-clerical and Other-service.
Male occupations: Handlers-cleaners, Armed-forces, Transport-moving, Craft-repair, Protective-serv.

Section 3: Reflection

The dataset has very interesting information that characterizes the respondents in different sociodemographical ways. Nevertheless, we can say with fair amount of certainty, that the dataset is not representative of the US population, for example by gender and race distributions.
In the dataset, women and black/native americans are paid much less than men and white/asian people. This does not necessarily imply discrimination, but is suspicious.
People originating from different countries differ in income, and most of the low-income countries are in Latin America. Most of the high-income countries are rich and developed, although it is surprising to see Cambodia, Yugoslavia and even Iran on the top of the list. Anyway, 90% of the people are US-natives, so this variable doesn’t give too much information.

The dataset mostly contains categorical variables, having more numeric variables would be interesting.
Mainly, if our dependent variable (income) was numeric, it would open more opportunities for exploration.
It would be also very interesting to have the same data for a different year to examine the trends.